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I. INTRODUCTION 

Randomized benchmarking protocols p] are a promis- 
ing approach to the experimental assessment and evalua- 
tion of quantum information processing proposals. They 
are actively used to benchmark quantum information 
processing proposals [3HI]. The advantages over other 
methods include the independence from the physical im- 
plementation details of those quantum information pro- 
cessing systems being tested [21 0], and scalability. A 
randomized benchmarking protocol may be described as 
a repeated application of a set of randomly chosen Clif- 
ford operations, followed by the measurement. Access 
to time optimal implementation of Clifford operations 
allows to reduce the time required to perform a given 
benchmarking experiment, and it is thus important for 
present practical purposes. 

A goal of an experimentalist desiring to employ a ran- 
domized benchmarking protocol is to construct a com- 
plete set of physically implementable operations that can 
be used to generate any Clifford operation, and then 
be able to express any Clifford operation using the set 
of such implementable operations available. Those im- 
plementable operations are furthermore referred to as 
elementary operations. To illustrate, in [2] the set of 
elementary operations consists of the two-qubit phase 
gate (controlled-Z) and all single qubit Clifford and Pauli 
gates. In @], the two-qubit ZZ-interactions are provided 
by the driving Hamiltonian, single qubit gates in the X- 
Y plane are implemented as RF pulses, and single qubit 
gates in the Z-plane are implemented through a frame 
change, and require no physical action (as such, they 
are "free of charge" ) . The amount of physical resources 
required to implement each elementary gate, as well as 
the very set of gates that may be implemented directly, 
varies from one quantum information processing proposal 
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to another. Unable to capture all possible elementary 
gate libraries and circuit cost metrics, we concentrated on 
the study of quantum circuits composed with Hadamard 
gate, Phase gate (and its inverse), and the two-qubit 
CNOT gate, and two simple metrics of circuit cost — the 
gate count and the depth. However, we designed our 
algorithms and implementation such that they may be 
modified to accommodate essentially any gate library, as 
well as more sophisticated metrics of the circuit cost. 

In particular, we study the problem of the optimal syn- 
thesis of Clifford operations acting on a small number of 
qubits. We determine the cost of the overall Clifford 
operation based on the number of single and two-qubit 
elementary operations required to implement it. This 
constitutes a simple measure for estimating the difficulty 
of implementing Clifford operations in an experiment. 
We synthesize optimal Clifford circuits on two to four 
qubits, and optimal Clifford circuits on five qubits up to 
input/output permutation. We use the optimal imple- 
mentations of Clifford operations acting on a small num- 
ber of qubits in peep-hole optimization [5] of larger Clif- 
ford circuits. The experiments reveal substantial practi- 
cal improvement in large-scale designs of Clifford circuits. 
Finally, we apply the ideas developed in the paper to find 
an optimal encoding circuit for the five-qubit error cor- 
recting code. This method can be applied to synthesize 
encoding circuits for other error correcting codes that use 
a small number of qubits. 

Stabilizer circuits have been studied well in the rel- 
evant literature: [3] reports an 11-stage layered de- 
composition of the n-qubit Clifford operations using at 
most 0(n 2 /log 2 n) gates; [7] develops linear depth im- 
plementations. Both papers report asymptotically op- 
timal implementations, however, suboptimal in the ab- 
solute sense. As reported in [2J, finding optimal imple- 
mentations of Clifford circuits with up to two qubits is 
straightforward. In our paper, we report optimal Clifford 
circuits for up to four qubits, optimal Clifford circuits up 
to input /output permutation for up to five qubits, and 
optimize scalable implementations of the Clifford circuits 
by a factor of roughly two. 
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II. PRELIMINARIES 

Clifford quantum circuits consist of Hadamard (H), 
Phase (P) and CNOT gates. The important property of 
these gates is that they map Pauli matrices 

and their tensor products to themselves by conjugation. 
More precisely: 

HXH^ = Z, HYH^ = -Y, HZH ] = X, 
PXP^ = Y, PYP ] = -X, PZP ] = Z, 

The CNOT gate acts on two qubits and its action is: 

X <E) I n- X ® X, Z <g> 1 1-> Z ® i", 
I <E) X i-> I <8> X, I <E) Z H> Z <E) Z. 

Compact representation of any unitary that can be 
computed by a Clifford circuit is a direct consequence of 
the Clifford gates' property described above. Action of a 
circuit on any input is uniquely defined by this represen- 
tation 8 . Taking into account the identity Y = iXZ, it 
suffices to know the action by conjugation of the n-qubit 
circuit on 2n Pauli matrices. The result of the application 
of the circuit to each Pauli matrix can be encoded using 
2n + 1 bits [6]. Pauli matrices are encoded as follows: 

J~(0|0),X~(1|0), Z~(0|1) K~(l|l) 

It is convenient to separate X and Z parts when encoding 
larger circuits: 

J~ (0|0),X~ (1|0),-/®X~ (01|00|1) 

One additional bit is used to encode the overall sign. For 
any unitary the sign can be adjusted by applying the 
round of Pauli gates at the end of the computation. In 
most of our applications this can be done for free. Fur- 
ther we will consider only the 2n x 2n part of the en- 
coding matrix. Commutativity relations between Pauli 
matrices are preserved under conjugation and induce ad- 
ditional constraint on the encoding matrix — it must be 
symplectic. Furthermore, the canonical decomposition 
theorem [5] shows that any binary symplectic matrix en- 
codes some Clifford circuit. 

The tableau representation can be efficiently updated 
[B] when adding new gates to the end of an existing cir- 
cuit. Adding a gate requires to update the one or two 
columns of the encoding matrix. The application of the 
Phase gate to qubit k corresponds to the addition mod- 
ulo 2 of column k to column n + k, the Hadamard gate on 
qubit k corresponds to exchanging columns k and n + k, 
and the CNOT gate with control k and target j cor- 
responds to the addition of column k to column j and 
the addition of column n + j to column n + k. An empty 
Clifford circuit corresponds to the identity matrix. These 
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TABLE I: G - group: Sp - symplectic part of Clifford 
group, Gl - group generated by linear reversible circuits; 

n - number of qubits, Nq - size of the corresponding 
group, SizeQ r - lower bound on the size of the database 
taking into account input/output renaming (GB). 



rules suffice to determine the 2n x 2n binary symplectic 
matrix encoding the unitary computed by a given Clif- 
ford circuit. 

For linear reversible circuits — those composed only 
with CNOT gates — it suffices to store only the top left 
nxn part of the binary symplectic matrix. The described 
procedure for updating columns immediately implies that 
the binary symplectic matrix for linear reversible circuit 
should be of the following form: 



A 








B 



As the matrix must be symplectic we have A T B = I, 
which uniquely determines B given A. Therefore, we can 
store linear reversible unitaries more efficiently than a 
generic Clifford operation. 

The two optimality measures that we consider are the 
minimal number of the Clifford gates required and the 
minimal depth of the circuit implementing the given uni- 
tary. For brevity, we call them the gate count and the 
depth of the unitary. Our ideas extend to other optimal- 
ity measures, such as the number or the depth in terms 
of the CNOT gates. 

III. ALGORITHMS 

The main challenge in our approach to finding opti- 
mal circuits is a large search space (Table |T]). Our algo- 
rithm is based on the Breadth First Search. The number 
of distinct unitaries computed by Clifford circuits grows 
as 2 e ( n >. We address the resulting challenge in several 
ways. First, each node of the search tree corresponds to 
an equivalence class of unitaries instead of the unitary 
itself. Second, we use meet in the middle technique to 
avoid building the full tree [£]■ Finally, we use a special 
data structure to store the search tree in a compact way. 

The equivalence relation we use to reduce the size of 
the search space is the following: two unitaries are equiv- 
alent if they can be computed by circuits that are the 
same up to simultaneous renaming of their inputs and 
outputs. Both gate count and depth of a unitary are 
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invariant with respect to such simultaneous renaming. 
During the search we store only a canonical representa- 
tive of each class. For n inputs this results in a reduction 
of the number of unitaries to be stored by a factor of ap- 
proximately n\. The number of unitaries corresponding 
to the same canonical representative is not always n!, but 
this is the most common case. In particular, the fraction 
of four-qubit unitaries that have less than 24 (= 4!) ele- 
ments in their equivalence class is less than 9.7 x 10~ a . 
To search for five-qubit optimal Clifford circuits we used 
the equivalence relation corresponding to the indepen- 
dent renaming of the inputs and outputs, in other words, 
we ignored SWAP gates. This further shrinks the search 
space, but the results are suboptimal in the scenario when 
SWAP has a non-zero cost. 

The idea of the meet in the middle (MiM) technique 
is based on the optimality of subcircuits of any optimal 
circuit. Given a database DB C of all unitaries with the 
cost at most c, MiM allows to find optimal circuits for 
unitaries with the cost at most 2c. Suppose we are look- 
ing for an optimal circuit computing a unitary / with 
cost c + d < 2c. We can always split the optimal circuit 
into two optimal circuits with d and c gates. There- 
fore, there always exist a unitary g with cost d < c such 
that its composition with / has cost c and it is in our 
database. We can find g by trying all unitaries from the 
database and checking if g o / is also in the database. In 
the worst case, using meet in the middle increases the 
time required to find a circuit by a factor proportional 
to the size of DB C , in comparison to using the database 
DBic- At the same time, meet in the middle significantly 
reduces the required memory. For example, in the case 
of four qubits the maximal number of gates required is 
17 and the size of the database is 14.72 GB. Using the 
database with optimal circuits up to 9 gates reduces the 
required memory to just 108 MB. Meet in the middle is 
vital for the search of optimal five-qubit Clifford circuits 
up to input/output permutation. In this case, the size of 
the full database would have been about 3.08 x 10 6 GB. 



A. Computing canonical representative 

To find the canonical representative with respect to the 
simultaneous renaming of the inputs and outputs we com- 
pute all elements of the equivalence class, encode them as 
bit strings and find the minimum. We need to go though 
all possible permutations. This is accomplished by ap- 
plying a single transposition at each step. Exchanging 
inputs k and j of an n-qubit Clifford circuit corresponds 
to swapping columns and rows of the binary symplectic 
matrix. The pair of columns (k,k + n) must be swapped 
with (J, j + n) , pairs of rows with the same indexes must 
be swapped also. Internally we represent each binary ma- 
trix as an array of integers. Each integer corresponds to 
a column of the binary symplectic matrix. We precom- 
pute required transpositions of the bit strings of length 
2n and use a lookup table to speed up the swapping of 



rows of the binary symplectic matrix. 

When we allow an independent renaming of the in- 
puts and outputs we apply a more efficient procedure 
for canonical representative computation. In most cases 
we have (n!) 2 representatives corresponding to the same 
equivalence class. First we find all nl representatives cor- 
responding to the different row permutations [10] . Then 
we store columns k and k + n together in one bit string 
and sort the resulting bit strings using a sorting network. 
This gives a canonical representative with respect to col- 
umn permutation for a fixed row permutation. Finally, 
we encode the representative for each row permutation 
as a bit string and find the minimum. 

For linear reversible circuits we apply the same idea. 
To exchange two inputs k and j we just need to swap 
columns k and j and rows k and j of the matrix encod- 
ing the circuit. This approach also extends to partially 
specified matrices. 



B. Implementation details 

The main bottleneck in our search is the amount of 
memory available. In addition to using canonic repre- 
sentation, we tried to minimize the memory overhead 
caused by the data structures. Here we describe the de- 
tails of the gate count optimal search. The same ideas 
were adopted for depth optimal search and can be ex- 
tended to more general cost functions. We did not target 
to study all possible optimizations in a systematic way. 
We present a set of solutions that allowed us to obtain 
the results in a reasonable amount of time and designed 
our software to be scalable enough to support different 
types of search. 

Possible costs of unitaries belong to a short range of the 
integer values. For this reason, we introduced a separate 
data structure to store unitaries with the fixed cost. We 
call it a layer. We build layers one by one. To build 
the layer k we pick an element of the layer k — 1 — we 
call it a parent unitary. Then we compose it with all 
possible gates and check if the resulting unitary was not 
found earlier. The only possible costs of the resulting 
unitary are k, k — 1, or k — 2. If we get cost less than 
k — 2 this contradicts the knowledge that the cost of the 
parent unitary is indeed k — 1. Therefore, during the 
search we need to keep only two previous layers in the 
memory. We repeat the procedure for all unitaries in the 
layer k — 1. It can be executed in parallel for several 
parent unitaries. Only the addition of the unitaries with 
cost k to the corresponding layer must be synchronized. 

Finally, we describe how to find a circuit using the 
precomputed layers. If we find that a unitary belongs 
to the layer k this means that there exists a circuit with 
k gates computing the unitary. Therefore, by removing 
the last gate in the circuit we obtain an optimal circuit 
with k — 1 gates which corresponds to a unitary with cost 
k — 1. By composing the source unitary with all possible 
gates and checking cost of the result we identify the last 



4 



tt unitaries 
10 1 



tt unitaries 
10 10 

10 8 

10 6 

10 4 

100 



10 



15 



depth 



FIG. 1: The number of optimal Clifford circuits on 2, 3, and 4 qubits per optimal gate count and depth. 
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FIG. 2: The number of optimal Clifford circuits on 2, 3, 
and 4 qubits per optimal number of controlled-Z gates. 



0.35 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 



12 



4 



FIG. 3: Estimated proportion of the 5-qubit Clifford 
unitaries per optimal gate count (independent 
input /output renaming allowed). 



gate in the optimal circuit. We proceed further in a simi- 
lar fashion, until we reach the canonical representative of 
the identity. In the case when we rename inputs and out- 
put simultaneously we always get an identity in the end. 
When renaming of inputs and outputs is independent we 
obtain a circuit that is composed entirely of SWAP gates 
that represents a permutation of the inputs. 



IV. EXPERIMENTAL RESULTS 

In this section we describe the results of our search to- 
gether with the optimization experiments that rely on the 
databases of the optimal circuits we found. For the ex- 
periments that require more than 8 GB of RAM memory 
we used a high performance server with eight Quad-Core 
AMD Opteron 8356 (2.30 GHz) processors and 128 GB 
of RAM memory. These are the experiments involving 
4- and 5-qubit Clifford unitaries. For all other experi- 
ments we used a machine with a single quad-core Intel 
Core i7-2600 (3.40 GHz) processor and 8 GB of RAM. 



A. Distribution of the optimal circuits 

We found optimal circuits for Clifford unitaries acting 
on 2-4 qubits (Figs. [I] [2]) and optimal linear reversible 
circuits acting on up to 6 qubits. In both cases we found 
both circuits with the optimal gate count and those with 
the minimal depth. For the case of Clifford unitaries we 
also found circuits with optimal number of Controlled-Z 
gates. 

Distributions reported in Figs, [I] [2] are interesting for 
the randomized benchmarking of quantum information 
processing systems. The benchmarking protocol |llj in- 
volves the application of a large number of randomly cho- 
sen Clifford unitaries. Knowledge of the distribution of 
the number of gates allows to estimate the average time 
required for each experiment, and evaluate its feasibil- 
ity due to, e.g., the effects of the decoherence. Using 
optimized circuits minimizes the time required for an ex- 
periment. In addition, computation of the normalized 
quantities describing the quality of two gubit gates im- 
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TABLE II: The results of application of the peep-hole optimization to encoding circuits for Quantum Error 
Correcting codes. [[n,k,d]] denotes the code that uses n physical qubits, encodes k logical qubits and has distance d, 
Cfc - number of gates in the circuit obtained using Algorithm k, c^o ~ number of gates in the circuit after application 
of the peep-hole optimization using the database of 4-qubit optimal Clifford circuits, tk - runtime of peep-hole 
optimization software (in seconds) as applied to the circuits produced by the Algorithm k. 



plementation (mentioned in the extended version of [5]) 
requires knowing the average number of two qubis gates 
used, which follows directly from our data. 



and the use of the meet in the middle allow to find op- 
timal circuits for any 5-qubit Clifford unitary up to in- 
put/output order. 



B. Five qubit Clifford functions 



C. Peep-hole optimization 



The search for five-qubit unitaries up to input/output 
order is challenging, but it is still tractable using modern 
computers. The number of the different unitaries on five 
qubits is about 2.4 x 10 17 (Table [j]). We need 100 bits to 
store each group element. Factoring out simultaneous re- 
naming of inputs and output allows us to reduce the size 
of the database by approximately 120 times. However, 
one still needs 3.08 x 10 6 GB to store the full database 
in this case. To allow the search of any 5-qubit Clif- 
ford unitary up to input/output order we allowed the 
independent renaming of the inputs and outputs of the 
circuits and used meet in the middle [9] approach. Wc 
synthesized all 5-qubit unitaries that use up to 11 gates 
which allowed us to search for unitaries that require up 
to 22 gates. It is unknown what is the maximum number 
of gates needed to implement any 5-qubit Clifford uni- 
tary. We ran an experiment to estimate the distribution 
of the number of gates required to implement a unitary. 
We used the algorithm described in [12] to generate uni- 
formly distributed random Clifford unitaries and found 
their gate count. The distribution of the number of gates 
for 5-qubit unitaries, shown in Fig. [3j was obtained us- 
ing 20,000 samples. We used Hocffding inequality [13] to 
estimate errors for confidence level 0.999. Based on the 
above calculation, we believe that the 11-level database 



We used the database of the optimal 4-qubit Clifford 
circuits to perform peep-hole optimization described in 
detail in [5] . We applied it to encoding circuits for quan- 
tum error correcting codes (QECCs). To obtain an en- 
coding circuit for QECC one starts with the stabilizer 
generators of the code and applies an algorithm that 
produces the encoding circuit. We implemented two al- 
gorithms. The first one is a version of the canonical 
decomposition theorem [3] for stabilizers that produces 
layers of CNOT, H, and P gates (Algorithm 1). The 
second one (Algorithm 2), taken from [15] . produces cir- 
cuits that do not have an expressed layered structure. 
Table [IT] summarizes the results of our experiment with 
codes from [16] . Applying peep-hole optimization to the 
circuits produced by Algorithm 2 results in a reduction 
of the number of gates by 45-53%. 



D. Optimal encoding circuit for five-qubit 
quantum error correcting code 

Using a slightly modified version of our algorithm we 
found a depth optimal circuit for the five-qubit [[5, 1, 3]] 
error correcting code. This code encodes one qubit and 
corrects any single qubit error. In this case only first 
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FIG. 4: Optimal encoding circuits for the five-qubit code: (left) depth optimal circuit, depth=5; (right) circuit with 
the minimal number of gates, being 11 gates. Input marked as \^} corresponds to the state that should be encoded. 
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FIG. 5: Encoding circuit for the five-qubit code used in [2]. The two-qubit gate corresponds to e lZZ7r / 4 . Eight of 

them are required to implement the encoding circuit. 



four out of 10 lines of the binary symplectic matrix are 
specified. We first found depth optimal circuits that pro- 
duce matrices with different first four lines. The problem 
has an extra freedom — the addition of lines of the binary 
symplectic matrix to each other does not change the code. 
In other words, left multiplication of the specified part of 
the binary symplectic matrix by 4 x 4 invertible binary 
matrix leaves the code unchanged. Search for all four-bit 
optimal linear reversible circuits gave us a database of all 
4x4 invertible binary matrices. We used it to go through 
all matrices equivalent to the one that defines the five- 
qubit code. Depth and gate count optimal circuits found 
are shown in Fig. [4j One of the best previously known 
circuits is illustrated in Fig. [5] Our approach may also 
be used to synthesize optimal circuits for other quantum 
error correcting codes that use a small number of qubits. 

V. CONCLUSIONS 

We explored the limitations of the brute force search 
for optimal circuits implementing Clifford and linear re- 
versible unitaries. Using typical memory and processing 
power available today, it is possible to search for up to 
four-qubit optimal Clifford unitaries and six qubit lin- 
ear reversible unitaries. We also demonstrated that ad- 
ditional assumptions allow to search for optimal Clifford 
unitaries up to input/output order. It is possible to make 
further assumptions resulting in greater sub-optimality, 
but reducing the size of the search space. For example, 



one may allow to apply Hadamard gates to each output 
in the end of the circuit for free. This will further reduce 
the size of search space by approximately 2™, where n is 
the number of qubits. It is easy to come up with canon- 
ical form computation for this case. Of course, circuits 
produced by the algorithm will not be exactly optimal. 
However, the results will be very close to optimal if the 
cost of Hadamard gates is small. Using more restricted 
gate sets, such as those that allow only nearest neighbour 
or two nearest neighbour interactions has the opposite ef- 
fect. In such case we do not have the symmetry between 
all qubits, which results in the growth of the search space. 

Using lookup in our database as a part of the peep-hole 
optimization shows that this is an efficient and promising 
approach for the optimization of larger Clifford circuits. 
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